Brief Summary of Data
And Data Source
This is a personal project. The motivation for this project is to
learn new bioinformatics skills related to viral research and expand my
knowledge. This project analyzes CMV (Human betaherpesvirus 5) genomic
data related to congenital infections. Data analysis comprises finding
drug resistance mutations, CMV genotyping, and phylogenetic analysis of
samples.
Raw sequence data used in this project has been previously shared in
NCBI-SRA by the University of Glasgow (HCMV_Congenital_Collection) under
the BioProject Accession Number PRJEB48078 (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB48078).
This BioProject contains 3 Genomic DNA submitted by Salvatore Camiolo
(University of Glasgow, School of Infection & Immunity). Instrument
used was Illumina MiSeq in PRJEB48078. And researchers used hybrid
selection as a selection method.
All bash and R-scripts and python notebooks related to the current
project have been shared in the GitHub repository https://github.com/osmanmerdan/Congenital-CMV. Tool
parameters have been explained in detail in scripts files.
This project does not aim to find optimal best practices to
analyze CMV NGS data. For readers who are interested in a user-friendly,
fast, all-in-one solution to analyze the CMV genome, please check
GRACy(Camiolo et al. (2021)). This HTML
document was created using RMarkdown. Plots were created using the
ggplot package if otherwise are not indicated.
CMV Viral
Particle
Human cytomegalovirus formally known as Human Herpes Virus 5 (HHV5)
is a member of Herpesviridae family. HHV-6, HHV-7, and CMV are
classified as betaherpesviruses.
Complete HCMV particles have a diameter of 120-200 nm. HCMV
nucleocapsid encircles large (220- to 240-kb) linear double-stranded DNA
genome. Outside of nucleocapsid there is matrix and a surrounding
phospholipid-rich envelope.
The HCMV genome is the largest among herpesviruses. It consists of
more than 170 nonoverlapping ORFs (Open Reading Frames). Genome encodes
structural, regulatory and immune modulatory proteins.
CMV is one of most important pathogenes causing congenital infections
ultimately leading sensorineural hearing loss and neurodevelopmental
problems.
HCMV DNA enters the host cell in linear form. HCMV genome has an
unpaired base at each end which facilitates circularization of the
genome which enables rolling circle replication. This replication
mechanism creates genome length covalently linked copies of viral genome
which are called concatemer. Concatemers are later cleved.
HCMV genome has class E organization with two domains that can be
inverted relative to each other (Fig 1). That organization
yields equal amounts of four genomic isomers that can be isolated from
samples. These two domains are the long and short genome segments (L and
S). Each domain contains a unique central region (UL and US). Those
regions are flanked by repeated components that reside at either the
terminal regions of the genome (TRL and TRS) or intersection of L and S
segments (IRL and IRS). General organization of HCMV genome is
TRL-UL-IRL -IRS-US-TRS. Terminal end internal repeats shares a region of
couple of hundreds bps, called a sequence.
Quality Control Of
Trimmed and Filtered NGS Data
Paired end seqeunce data of 36 samples were downloaded from SRA using
sratoolkit (3.0.0) fastq-dump. Reads were timmed with Trimmomatic (Bolger, Lohse, and Usadel (2014)) (version
0.39). The minimum sequence length was set to 200 bp. After Trimmomatic
operations, FastQC (“FastQC”
(2015)) and MultiQC (Ewels et al.
(2016)) were performed to get summary statistics about
raw-filtered NGS data.
The below table shows some of the summary statistics.
HCMV
wild-type strain Merlin genome size is 235646 bp. To
achieve reliable results from the variant analysis, at least a 20 bp
coverage threshold has been set in present NGS data analysis project.
That means at least 20000 (200bp long) reads are needed
(red line in
the Plot 1). Some of the low unique read count samples are:
HCMV Merlin strain genome has a GC content of %57. All the samples had
some level of deviation from the expected %57, which several issues like
contamination, PCR bias, low genome coverage, and hypervariable regions
in the CMV genomes may have caused (Fig 2).
Mean quality scores were sufficient enough for downstream analysis
(Fig 3).
Alignment
Results
Reads were aligned to
Human herpesvirus five strain Merlin, complete
genome (GenBank Accession: AY446894.2) using BWA-MEM (
Heng Li (2013)) (Version: 0.7.17-r1188).
Alignment files were filtered and sorted using Samtools (
Danecek et al. (2021)) (Version: 1.16.1). Output
bam files were de-duplicated using Picard MarkDuplicates (
“Picard Toolkit” (2018)).
Some important summary results from de duplicated-bam alignment files
were presented below.
Samples with mean depth lower than 20 or coverage lower than 90% were
listed below.
ERR7018474 and ERR7018433 could cover 50% of the reference genome with
depth(X)≥1
(Plot 2). Five samples (ERR7018443, ERR7018474,
ERR7021883, ERR7029111, ERR7039821) had low coverage and low depth
across the genome
(Fig 4). A large chunk of the reference
genome had depth(X)<20 for these five samples
(Plot 2). They
were not further processed.
There were three regions not covered in any sample, in the beginning, at
the end, and around the 200000th position
(Fig 4). Those
regions correspond to TRL, TRS, IRS, and IRL regions in AY446894.2,
which are listed below.
However, there were indications of possible deletions around positions
6000 and 11000
(Plot 3). Record ERR7036363 had zero depth in
those two regions
(Plot 3). ERR7032218 showed signs of deletion
only around position 11000. ERR7024258 was included for
comparison.
ERR7036363 showed significant sequence variation in the RL6 gene, as
indicated by soft-clipped reads
(Fig 5). That’s why ER7036363
had zero reads mapped at corresponding positions. ERR7032218 has some
SNPs and a deletion in the RL6 gene
(Fig 5). ERR7024258 has
nucleotide content very similar to the Merlin strain at corresponding
parts
(Fig 5).
ERR7036363 and ERR7032218 showed significant sequence variation with
signs of possible deletion in regions corresponding to the genes RL11,
RL12, RL13, and UL1
(Fig 6). In addition, ERR7024258 has many
sequence variations compared to the Merlin strain in these gene regions
(Fig 6).
RL6, RL12, RL13, and UL1 genes are members of the RL11 gene family.
These regions are known for showing sequence variations between clinical
isolates. (Sijmons, Van Ranst, and Maes
(2014), Dolan et al. (2004))
In the
Fig4 positions between 180800 and 181300, there is a
uniform drop in the the number of reads mapped across the samples.
However, some samples had a second drop in the number of reads mapped
around position 186800. ERR7025502 and ERR7039855 had zero mapped reads
in the region neighboring position 187500
(Plot 4). ERR7025502,
ERR7039855 and ERR7019208 had zero reads mapped around position 181000
(Plot 4).
UL146 gene has been shown to be particularly variable.(
Dolan et al. (2004))
(Fig 7).
ERR7025502, ERR7039855 and ERR7019208 had soft clipped reads in UL146
region indicating large amount of sequence variation between samples and
reference Merlin strain
(Fig 7).
Another region showing sequence variation between different HCMV
clinical isolates is the UL139 region (
Qi et al.
(2006)). The alignment status of reads belonging to ERR7025502,
and ERR7039855 showed sequence variation in UL139-150A regions
(Fig
8).
In summary, CMV has a lot of regions that pose high levels of sequence
variation between clinical strains. Those regions could be rendered as
low-coverage, low-depth regions during alignment process.
Variants
Variants were called on deduplicated-bam files and filtered using
bcftools (version 1.16).(
H. Li (2011)).
Later SnpEff database for AY446894.2 was build following instructions
shared on SnpEff website. (
http://pcingola.github.io/SnpEff/se_faq/#how-to-building-an-ncbi-genome-genbank-file)
Variants in coding regions were annotated using SnpEff (
Cingolani, Platts, et al. (2012)). Annotated
variants were extracted with SnpSift (
Cingolani,
Patel, et al. (2012)). Number of different annotations were
summarized below table.
Note that beacuse the of overlapping genes
and regions below numbers are not exact representation of number of
variants. During the variant extraction process each variant extracted
seperate rows to enable quick downstream pprocess.
Some samples had a high number of non-synonymous SNPs at regions RL12
and RL13
(Plot 5). However, nearly half of the samples had zero
non-synonymous SNP in RL12 and RL13
(Plot 5). RL12 and RL13
bind Fc portion of human IgG1 and IgG2.(
Cortese
et al. (2012)) and they show significant number of sequence
variations between clinical strains. Manually inspecting alignments
files for the R12 and RL13 genes revealed that some samples
(e.g.ERR7018455) had no coverage in the region which lead zero detected
sequence variation
(Fig 9, Fig 10). Hypervariable regions in
HCMV genome leads to complicated alignments which can significantly
alter the sequence variation results.
A higher number of non-synonymous SNPs were observed in UL8, UL74, UL75,
and UL150 genes
(Plot 5, Plot 6). The products of these genes
were summarized in the following table.
UL74, and UL75 are involved in cell entry (Ye et
al. (2020)). UL74 and UL75 codes for envelope glycoproteins O and
H, respectively. UL74 is related to cell tropism and immunomodulation
(Sijmons, Van Ranst, and Maes (2014)).
UL74 region is also a hypervariable region. UL150 is involved in immune
regulation which can explain high number of non-synonymous SNPs.
Membrane glycoprotein 8 is the product of UL8. Each sample had 30 or
more non-synonymous SNPs in the UL8 gene
(Plot 6). UL8 is shown
to be able to reduce the production of pro-inflammatory cytokines
significantly. UL7 and UL8 share 195 amino acids (aa) long sequence,
which includes ~35 aa signal peptide and ~100 aa IgV-like domain (
Pérez-Carmona et al. (2018)). Previous studies
have shown that. HCMV strains has %79-%100 sequence identity and most of
the sequence variation in the UL8 gene has been observed in the stalk
region which is ~140 aa long. Similar results obtained from the variant
analysis
(Plot 7). 3’ end of UL8 coding region showed high
number of sequence variations
(Fig 11).
Drug Resistance
Mutations
Main drugs used to treat CMV infections are Ganciclovir, Foscarnet,
Cidofovir, Letermovir and Maribravir. UL54 (DNA polymerase) mutations
can alter Ganciclovir, Foscarnet and Cidofovir succeptibility. UL97
kinase mutations effect Ganciclovir and Maribavir succeptibility.
Maribavir and Letermovir resistance can arise because of the mutations
in UL27 or Terminase gene (UL56, UL89, UL51) respectively.
Annotated-VCF files were searched for the presence of known antiviral
resistance-related mutations using the Hrpesvirus Drug Resistance
Genotyping web server (
http://cmv-resistance.ucl.ac.uk/herpesdrg/). No
antimicrobial resistance mutation has been found in any sample. One
exciting feature is that most samples have Gln126Leu and Asp605Glu amino
acid polymorphisms in the UL97 translated sequence _(Plot8__. Below
table summarizes detected polymorphisms by the server. Asp605Glu
polymorphism previously shown to be more prevalant in infants with
primary HCMV infection compared with HCMV recurrence in transplant
recipients (
Sun et al. (2012)).
No previous literature record was found for Thr75Ala amino acid
variation in UL97 translated sequence. However, it was present in 31
samples
(Plot8). Other mutations in UL97 were already mentioned
in the literature and are frequently found in clinical samples. (
Lurain and Chou (2010)) Characterized
No previous literature record was found for Gln339Arg and Thr1122Ile
amino acid changes in UL54 translated sequence
(Plot 9). Number
of samples has two SNPs (HGVS_C notation c.3365C>T, c.3364A>G)
(Ref Pos 78558 G->A and Ref Pos 78559 T -> C) affecting codon
corresponding to amino acid at position 1122 in UL54 protein
(Fig 12
and Fig 13). Those two SNPs were pictured as two different amino
acid changes (Thr1122Ile and Thr1122Ala) in the annotated vcf files. But
combined effect of two SNPs resulted in a codon change from ACC to GTC
which actually caused Thr1122Val. There were also some samples with
neither of mutations (Thr1122Ile and Thr1122Ala).
Gln339Arg was found in five samples. Remaining mutations were
characterized by other researchers.(
Lurain and
Chou (2010))
There was no resistance related mutations in UL56, UL51, UL89, UL27
genes. Some missense variations were present in nearly all samples
(Plot 10).
Genotyping
The stain numbers were estimated following previously describes steps
(Suárez et al. (2019), Govender et al. (2022)) using the miRNA_Search
program (https://github.com/centre-for-virus-research/VATK/tree/master/HCMV_pipeline).
This program searches the presence of short motifs in the sequence data.
Previously listed short motifs has been used (Govender et al. (2022), Supplementary Tables).
However, to minimise biases arised from sequence duplciates FastUniq
used to remove duplicate sequences from sequencing reads (Xu et al. (2012)). At this step FASTQ files was
turned into FASTA format and later sequence identifiers removed before
processing files with miRNA_Search. Output files were parsed using a
modified version of perl script shared in HCMV_pipeline. Minimum number
of reads that carry one specific motif was set to 10 and the percetange
treshold for number of reads that carry any motif was set to 2%. Any
genotype motif that did not meet with this requirements was discarded.
Motifs representing RL5A, RL6, UL74, RL12, UL11, UL139, UL111A, UL9,
UL146, UL120, UL73, US9, RL13, UL1 regions were searched in the
reads.
There was no sample which has more than one strain. Only one genotype
has been detected for each sample - variable gene
combinations(genotype_counts.txt file). Genes, genotypes, number of
reads carry that genotype motif and the percetage of reads were
summarized in table below.
RL5A-G1, ULL11A-WT, and US9-WT were common elements in all samples. Two
different genotypes were spotted in UL9, UL11, UL1, RL6 regions
(Plot11). Other regions were little bit more diverse regarding
genotypes.
Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014.
“Trimmomatic: A Flexible Trimmer for Illumina
Sequence Data.” Bioinformatics 30 (15): 2114–20.
https://doi.org/10.1093/bioinformatics/btu170.
Camiolo, Salvatore, Nicolás M Suárez, Antonia Chalka, Cristina
Venturini, Judith Breuer, and Andrew J Davison. 2021.
“GRACy: A Tool for Analysing Human
Cytomegalovirus Sequence Data.” Virus Evolution 7 (1):
veaa099.
https://doi.org/10.1093/ve/veaa099.
Cingolani, P., V. M. Patel, M. Coon, T. Nguyen, S. J. Land, D. M. Ruden,
and X. Lu. 2012. “Using Drosophila Melanogaster as a Model for
Genotoxic Chemical Mutational Studies with a New Program,
SnpSift.” Frontiers in Genetics 3.
Cingolani, P., A. Platts, M. Coon, T. Nguyen, L. Wang, S. J. Land, X.
Lu, and D. M. Ruden. 2012. “A Program for Annotating and
Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff: SNPs
in the Genome of Drosophila Melanogaster Strain W1118; Iso-2;
Iso-3.” Fly 6 (2): 80–92.
Cortese, Mirko, Stefano Calò, Romina D’Aurizio, Anders Lilja, Nicola
Pacchiani, and Marcello Merola. 2012.
“Recombinant
Human Cytomegalovirus (HCMV)
RL13 Binds Human
Immunoglobulin G Fc.”
Edited by Michael Nevels.
PLoS ONE 7 (11): e50166.
https://doi.org/10.1371/journal.pone.0050166.
Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu
Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021.
“Twelve years of SAMtools and BCFtools.”
GigaScience 10 (2).
https://doi.org/10.1093/gigascience/giab008.
Dolan, Aidan, Charles Cunningham, Ralph D. Hector, Aycan F.
Hassan-Walker, Lydia Lee, Clare Addison, Derrick J. Dargan, et al. 2004.
“Genetic Content of Wild-Type Human Cytomegalovirus.”
Journal of General Virology 85 (5): 1301–12.
https://doi.org/10.1099/vir.0.79888-0.
Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016.
“MultiQC: Summarize Analysis Results for Multiple Tools and
Samples in a Single Report.” Bioinformatics 32 (19):
3047.
https://doi.org/10.1093/bioinformatics/btw354.
Govender, Kerusha, Raveen Parboosing, Salvatore Camiolo, Petr Hubáček,
Irene Görzer, Elisabeth Puchhammer-Stöckl, and Nicolás M. Suárez. 2022.
“Complexity of Human Cytomegalovirus
Infection in South African
HIV-Exposed Infants with
Pneumonia.” Viruses 14 (5): 855.
https://doi.org/10.3390/v14050855.
Li, H. 2011.
“A Statistical Framework for SNP
Calling, Mutation Discovery, Association Mapping and Population
Genetical Parameter Estimation from Sequencing Data.”
Bioinformatics 27 (21): 2987–93.
https://doi.org/10.1093/bioinformatics/btr509.
Li, Heng. 2013.
“Aligning Sequence Reads, Clone Sequences and
Assembly Contigs with BWA-MEM.” https://doi.org/10.48550/ARXIV.1303.3997.
Lurain, Nell S., and Sunwen Chou. 2010.
“Antiviral
Drug Resistance of Human
Cytomegalovirus.” Clinical Microbiology
Reviews 23 (4): 689–712.
https://doi.org/10.1128/CMR.00009-10.
Pérez-Carmona, Natàlia, Pablo Martínez-Vicente, Domènec Farré, Ildar
Gabaev, Martin Messerle, Pablo Engel, and Ana Angulo. 2018.
“A
Prominent Role of the Human
Cytomegalovirus UL8 Glycoprotein
in Restraining Proinflammatory
Cytokine Production by Myeloid
Cells at Late Times During
Infection.” Edited by Richard M. Longnecker.
Journal of Virology 92 (9): e02229–17.
https://doi.org/10.1128/JVI.02229-17.
Qi, Ying, Zhi-Qin Mao, Qiang Ruan, Rong He, Yan-Ping Ma, Zheng-Rong Sun,
Yao-Hua Ji, and Yujing Huang. 2006.
“Human Cytomegalovirus
(HCMV) UL139 Open Reading Frame:
Sequence Variants Are Clustered into Three Major
Genotypes.” Journal of Medical Virology 78 (4): 517–22.
https://doi.org/10.1002/jmv.20571.
Sijmons, Steven, Marc Van Ranst, and Piet Maes. 2014.
“Genomic and
Functional Characteristics of
Human Cytomegalovirus Revealed by
Next-Generation
Sequencing.” Viruses 6 (3): 1049–72.
https://doi.org/10.3390/v6031049.
Suárez, Nicolás M, Gavin S Wilkie, Elias Hage, Salvatore Camiolo,
Marylouisa Holton, Joseph Hughes, Maha Maabar, et al. 2019.
“Human
Cytomegalovirus Genomes Sequenced
Directly From Clinical
Material: Variation,
Multiple-Strain Infection,
Recombination, and Gene
Loss.” The Journal of Infectious Diseases
220 (5): 781–91.
https://doi.org/10.1093/infdis/jiz208.
Sun, J., G. Cao, L. Zhang, Y. Zhang, Z. Zhao, J. Hu, and Y. Ji. 2012.
“Human Cytomegalovirus (CMV)
UL97 D605E Mutation
Has a Higher Prevalence in
Infants With Primary
CMV Infection Compared
With Transplant Recipients
With CMV Recurrence.”
Transplantation Proceedings 44 (10): 3022–25.
https://doi.org/10.1016/j.transproceed.2012.06.069.
Xu, Haibin, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui
Qian, Jinhui Chen, and Shilin Chen. 2012.
“FastUniq:
A Fast De Novo
Duplicates Removal Tool for
Paired Short Reads.”
Edited by Daniel Doucet.
PLoS ONE 7 (12): e52249.
https://doi.org/10.1371/journal.pone.0052249.
Ye, Lele, Yunyun Qian, Weijie Yu, Gangqiang Guo, Hong Wang, and
Xiangyang Xue. 2020.
“Functional Profile of
Human Cytomegalovirus Genes and
Their Associated Diseases:
A Review.” Frontiers in
Microbiology 11 (September): 2104.
https://doi.org/10.3389/fmicb.2020.02104.